Matei Zaharia is a co-founder of Databricks, Stanford professor, and the creator of Apache Spark, the #1 tool developers use for creating applications today. In his current role as the Chief Technologist at Databricks, Matei oversees various data management, AI, and LLM projects, including Dolly an open-source LLM AI bringing ChatGPT capabilities to the enterprise.
As the Co-founder and CEO of Alation, Satyen lives his passion of empowering a curious and rational world by fundamentally improving the way data consumers, creators, and stewards find, understand, and trust data. Industry insiders call him a visionary entrepreneur. Those who meet him call him warm and down-to-earth. His kids call him “Dad.”
Producer: (00:01) Hello and welcome to Data Radicals. In today's episode, Satyen sits down with Matei Zaharia. Matei is an open source trailblazer and the mastermind behind Apache Spark, one of the most widely used frameworks for distributed data processing. Today, he oversees various data management and machine learning projects at Databricks and Stanford University. In this episode, Matei dives into the Databricks and Alation partnership assisting companies in owning their data and democratizing open source large-language models.
Producer: (00:32) This podcast is brought to you by Alation. Subscribe to our Radicals Rundown newsletter. You'll get monthly updates on hot jobs worth exploring, news we're following, and books we love. Connect with past guests and the wider Data Radicals community. Go to alation.com/podcast and enter your email to join the list. We can't wait to connect.
Satyen Sangani: (00:56) Today on Data Radicals we have Matei Zaharia. Matei is the co-founder and chief technologist at Databricks and an assistant professor at Stanford in computer science. During his PhD at UC Berkeley, he created Apache Spark and has contributed to other popular data and machine learning software such as MLflow, Delta Lake and Apache Mesos. He has received various prestigious awards for his research, including the 2014 ACM Doctoral dissertation Award, NSF Career Award, and the U.S. Presidential Early Career Award for scientists and engineers. We're super excited to have you. Matei, welcome to Data Radicals.
Matei Zaharia: (01:30) Thanks so much for having me. Very excited to be here.
Satyen Sangani: (01:33) So let's start with Spark, because in some ways your public story started there. You created the Spark Project and now that's become probably the most widely used framework for distributed data processing. Tell us the story behind it and how it came to be.
Matei Zaharia: (01:46) Yeah, for sure. So I started working on that in my PhD at UC Berkeley. So this, I was starting in 2007, is when I began there. I came into it, I was just interested in computer science in general and I saw what I thought were the most interesting computer applications were these things happening in the large tech companies, things like search engines and social networks and stuff like that.
And they all involve processing large amounts of data, basically everything on the web or even like internal data generated by servers and machines there. And all these companies had these internal systems to do it. There was nothing really like it outside — so things like MapReduce and Google — and I thought data is actually not that expensive to collect and store. It's actually getting less expensive and it's very likely that more people will want to work on these super large data sets.
Matei Zaharia: (02:41) And whether in science or in industry, different types of companies or really like anyone who has lots of users and is trying to analyze what the users are doing and improve their products. So I really want to figure out how do these technologies work and is there any way to democratize them and bring this kind of large-scale data processing to more folks.
So initially, I started working with some users of these technologies of MapReduce, basically, and like learning a little bit about the problems. And I also realized early on that as soon as an organization set up basically a data lake, lots of data, and started collecting things, they wanted to do a lot more. They wanted to do traditional SQL because that's what every data worker knows. They wanted to do interactive queries and they also wanted to do more sophisticated algorithms like machine learning, which wasn't easy to support with MapReduce.
Matei Zaharia: (03:34) So it quickly went from “How do you get this kind of MapReduce stack in front of everyone?” to “How do you actually go beyond that and try to figure out additional use cases?” And so I started Spark as a project mainly focused on some of these new use cases on interactive queries and on machine learning at the beginning, and then kind of backed into also doing really well at a large-scale bulk data processing. There was really not much an open source that had that broad scope and that had the performance of Spark at that time. And so it sort of grew into this community. Of course, as a community grew and the project got a lot better, we got contributors from many places. We eventually decided to start a company that will spend quite a bit of its time at the beginning improving the project and so on.
Satyen Sangani: (04:19) So you mentioned Hadoop, which obviously was sort of the predecessor. And in some ways you think about Cloudera, the name Cloudera comes from this idea that they would've delivered Hadoop in the cloud, although at the time that wasn't really a real business model for them or didn't become one. You then follow up with Spark. Do you think of Spark as a singular innovation or do you think of it as multiple improvements to basically deliver something that maybe the promise of Hadoop didn't deliver? How do you conceive of it?
Matei Zaharia: (04:47) It was basically like one idea from the beginning, but maybe you can break it down into two pieces. So one thing that Spark did that you just couldn't do with the stack before is, it was meant to be this unified engine where you can run different types of computations and combine them into one flow and actually get really good performance optimization across that flow.
And before, in the Hadoop world, the idea was, from all the Hadoop vendors, if you looked at their website, the idea was there are many different open source engines, you need to set up and install. Like Hadoop was a stack, there was Hadoop MapReduce, which you used for certain stuff. There were various things for SQL, there were other things for machine learning, there were yet other things for streaming. And that is, of course, extremely complex because you’ve got all these different engines with slightly different interfaces, different versions of the SQL language, and so on that you have to hook together into an application.
Matei Zaharia: (05:47) So in Spark, we kind of went the other way and said, “Look, all of these are like ultimately earning some computation in parallel. Can we just have a single engine that does it?” And no one was really trying that. So that in itself made it simpler and made it more powerful because you had all these new algorithms we didn't anticipate that people built.
Just as an example, Spark is known for supporting streaming. But before we built streaming, one of our early users — one of the startups in the Bay Area that was using it — called me up and showed me. They said, "We have this really cool application. I don't know if the engine's supposed to be used this way, which is we keep loading new data every second and updating the state and like letting users query it.” And we looked at it and thought, “Wow, that's streaming.”
Matei Zaharia: (06:30) Then they were saying, “The only downside is after we run this for a few days it crashes. There's like there's some stuff you're not cleaning up after each computation.” And we're like, “Yeah, we never imagined someone would just run this for indefinitely long periods, but don't worry. We can fix that.” So there were things like this, some of the machine learning libraries as well, that came in and no one had tried. And related to that for software developers, the secret of software development is it's very often gluing things together. You don't want to write your own algorithms from scratch for things, you just want to find things on the web, import them into your project, and glue them together.
Matei Zaharia: (07:08) And the unified engine and also the interfaces we chose for it, and Python that made it really friendly for users, also enabled that. Whereas before in the Hadoop world you couldn't just grab a machine learning algorithm from the web and attach it to your SQL pipeline and attach that to a streaming thing in three lines of code. You had to maybe set up three different systems, set up some kind of infrastructure to orchestrate all those. So just this idea of “the system supports libraries” — like different people can write them and then a user can bring them into the same program and they actually interoperate easily — was kind of new to that world at the time.
And that became one of the big things we invested in. If you ever look at our talks early on, a lot of them are about libraries for Spark, although it's kind of fun, maybe, to implement your favorite graph algorithm or machine learning algorithm on top of Map and Reduce. You probably don't want to do that as a developer. You want to grab one that someone else did. So I'd say Unified Engine plus the resulting ecosystem was the difference.
Satyen Sangani: (08:11) In the early days, in the first couple of releases, what were you competing with at that point in time? Was it directly competing with Hadoop? Was this simply a better alternative to Hadoop that you were positioning at the time? Or was there some other competitive technology in the mindset of the buyer?
Matei Zaharia: (08:25) So honestly, with Spark, like a lot of the initial goal was to broaden the audience of people that can work with these large datasets that companies were accumulating. A company might set up Hadoop for one use case that they consider important. Let's say we got all the event logs of what people did on our retail site and we also bring in some data sets from the web and we just built a recommendation in general; we do some analysis. But once you have all that data in one place, there are many other things you can do with it and Hadoop required pretty advanced software engineers to work on it to do that stuff. So a lot of the initial goal was, can we get other users who would never have used Hadoop, like a data scientist or even a business analyst that's just using SQL and BI tools, to work with these large data sets.
Matei Zaharia: (09:16) So in that sense, it wasn't really competing, it was kind of complementing. Of course, over time people thought, “Hey, this is pretty nice. It's good for really quick data science stuff. Can I also use it to write my big pipelines that took me like six months to build with lots of Java development with Hadoop?” And so it started moving into that space.
But also like Spark, I mean I think most people view it as something you can run on Hadoop or you can just run it on the cloud directly on stuff like Amazon S3. And for us it was more about basically broadening the market and bringing not exactly the same kind of product, but something like that to more people and getting them excited about what they can do with data science and machine learning and just analytics at scale.
Satyen Sangani: (10:00) Were there particularly gnarly, complicated technical problems that you had to solve in the early days? Or did you feel like much of what you were doing was unifying various bits of infrastructure or various bits of compute logic that were already discovered, various algorithms and libraries that otherwise weren't supported?
Matei Zaharia: (10:15) First of all, I think there is a big design challenge to make these things work smoothly with each other. Just as an example, something we did pretty early on — actually, the Spark project was open sourced in 2010 and then around 2013, 2014, we were working on this — is we updated a lot of the internals of Spark basically to look like a SQL database engine. So we updated the system to be able to do query planning the same way your database would. And we changed a lot of the libraries, the machine learning libraries, the graph ones, and so on to use SQL operators like JOINs and SELECTs and GROUP BYs. And that allows us to get a powerful kind of query optimization across these applications. And also to keep improving their performance of all the libraries as we made new releases of the engine without those people having to go back and change their code in some way.
Matei Zaharia: (11:07) Like the way you use that in Spark now is data frames. Data frames are something you can use in Python that basically translates into SQL. Of course, you can also use SQL directly. So that I think wasn't obvious at the beginning. It wasn't clear that you could make these other algorithms on top of these and optimize across them.
And then beyond that design aspect, I would say, yeah, there are a lot of technical challenges with making something like this really work in a foolproof, easy-to-use way for everyone. So you can stand up like thousands of jobs, you can have thousands of people that are using the system each day and asking us to do weird things and it doesn't crash, it doesn't go slow, and so on. That kind of stuff you can work through as you get users. You kind of figure out, hey, what are the top problems they're running into? How do we fix them? And it's this ongoing sort of engineering effort.
Satyen Sangani: (11:56) It's interesting because if you look at the story of Databricks on the face of it, you see these fabulously, intelligent, brilliant computer scientists who are folks who could likely do anything — you obviously being at the forefront of that — and yet as you narrate your story you talk a lot about focusing on user problems, being doggedly persistent about solving them. I expect many of them would've been quite unsexy to address, but would've been actually quite tedious. And so the inside story sounds a little bit different from the outside story. Is that a fair reflection of what you feel is the reality of how the business got built?
Matei Zaharia: (12:30) Yeah. As I said, I was really interested in sort of democratizing this kind of technology, helping everyone take advantage of it instead of having just a couple of companies that can really do things with all the data out there. And that is a big part of computer science and especially computer systems — which is sort of the main field I'm in, is figuring out interfaces — is basically like human computer interaction in a sense. It's like, how should programmers work with a thing? How should even end users work with that? And how do you make it easy for them to do what they want without worrying about some of the hard technical problems, like pack those in a box and let them use it?
Matei Zaharia: (13:09) So personally, I was always really interested in, can people actually use my thing? I wasn't so interested in like, do I have something clever that looks good on paper? Other academics are impressed by it, but maybe it doesn't really solve the problem. And whenever I talk to someone and I realize, hey, their problem is something like slightly different from what I thought, I'm always super happy. Like even if it seems less interesting, usually if it is a problem for people, it means it is difficult and it's like good to like really think about it and figure out why.
Satyen Sangani: (13:39) When did you realize in your academic work that Spark could become a commercial endeavor? And how did that realization happen?
Matei Zaharia: (13:44) We definitely didn't set out to commercialize it and to start a company, but I think about two or three years in, we did see a lot of organizations picking it up and using it. And one really weird thing, kind of a slightly funny story is at the beginning we went to the vendors in this space, like the Hadoop ones, other like large tech companies. And we tried to get them to use Spark in their offering because it's just something you can use in your stack. And none of them really wanted to because they felt like they have to really own their technology. They felt like, oh, maybe if UC Berkeley stops developing this, we'll be stuck. We'll give our customers something that isn't really future proof and so on.
Matei Zaharia: (14:24) So it was actually pretty hard to convince them. So first of all, of course we were excited about starting a company and we thought with the rise of cloud computing, actually it's a good chance to create a new data platform company 'cause everyone is re-platforming anyway to move to the cloud. But also we thought that hey, to like really understand this space and have an impact, we need to have a company so that there's something enterprises can trust to like build and maintain the software long term.
Satyen Sangani: (14:50) Yeah. And the founding of it, obviously fabulously successful today and it's been roughly a decade?
Matei Zaharia: (14:56) Yeah. Pretty much. Yeah.
Satyen Sangani: (14:57) How long did it take from the point of founding to the point where you knew this was something significant and big? When was it obvious and clear that this was an unmitigated and runaway success? Was that year one, year five, year three.
Matei Zaharia: (15:09) There are always stages to it. So you're never like really done and, there's always more you can do, but I would say after about two years or maybe three years, that's when we hired a great head of sales. We had our products working, I mean it took, probably in the first year we were just building the first cloud version of the product and getting feedback from all the users and we saw that, hey, the sales team actually kind of hit a stride and was able to repeatedly get people to try the software and then they were growing a lot year on year and they were successful with it. So that's when we thought, okay, we have something repeatable here. It's not just a one off. Maybe like if the co-founders are heavily involved with the customer, like they'll try our stuff because they feel like they're getting a lot of attention.
Matei Zaharia: (15:55) But there are always levels to that. I mean there's always like, when do you get the first seven figure deal? When do you get something larger than that?
We were also at the beginning of the cloud and our strategy was to build, like just on the cloud initially we would, we always said, we'll go back and like, if major parts of the industry aren't shifting to cloud, then maybe we'll do an on-prem thing. But that didn't really happen. But for example, banks, like, we did this like tour of a bunch of banks in New York, I think in 2015, and they all said like, we'll never be on the cloud. And then a few years later when the first of those were actually moving there, those were kind of big milestones for us.
Satyen Sangani: (16:33) Yeah. I can imagine. And, very similar in terms of our journey. You now describe yourself as a lakehouse company. Tell us about what a lakehouse is.
Matei Zaharia: (16:42) Yeah, so lakehouse is a unified data management system that basically combines the benefits of data lakes and data warehouses. So it's very related to like Spark being a unified engine, but it's now for just data management as a whole, right? So Spark itself, it's only a computing engine, whereas Delta Lake is the technology we have for data management for like actually storing data and tracking different versions and doing transactions and so on. And then Delta Lake plus Unity catalog, which is sort of our operational catalog that you guys know well and work with. That's sort of like the metadata piece of it.
Matei Zaharia: (17:19) The idea behind lakehouse is, again, much like the idea behind Spark, we think it's much easier for organizations to work with one platform that can span the sort of large scale, maybe unstructured data you'd have in your data lake with the capabilities you get in a warehouse that include really high performance, caching, indexing, transactions, multiple versions, data sharing, stuff like that.
Matei Zaharia: (17:44) So we're designing this kind of unified system that does both and apart from like just, hey, the fact that you can have a single table or a single data set that you can use in directly and like say large-scale ML and in business intelligence and NDBT and so on. The other thing you get from this is we built it all on open interfaces and open data formats, which is the thing we inherit from the data lake world.
So part of it is figuring out how do you get like super high performance, powerful management features, powerful governance features on top of data and these open formats that historically has been more just like something in the engineering department. Like, it's hard to bring lots of users into.
Satyen Sangani: (18:27) Yeah, it's super interesting because if you think about a traditional relational database, all three of those elements would've existed, but as a bundled offering, so you would have the storage implicit in the schema. You'd have obviously compute and the compute engine and of course you'd have the underlying catalog with all of the metadata, which could be both technical in some cases, even sort of business metadata as people would describe it. And you're breaking all these things apart.
And of course, the challenge with that strategy is now you've got sort of competition on multiple vectors and especially in the world of open source, I mean Delta Lake is a standard. There are multiple standards that it could compete with.
Unity Catalog, is there a competitive thing to that that you see out there? I mean, I guess, all of the cloud mega scalers have sort of an equivalent, but it does mean that you've gotta innovate on multiple vectors and compete on multiple vectors. How do you think about that sort of challenge of integrated offering versus sort of standalone success? If one has to pick, how do you pick?
Matei Zaharia: (19:16) Well, definitely enterprises are looking for an integrated simple offering. If you have a product with sort of open interfaces, open access, of course there are many ways people can use it. But when people ask us like, "Hey, I just wanna get work done, how do I just set it up and make everything work well?" We have this sort of recommended way of doing things where we'll make sure you have really great features and so on. And the nice thing with that is if you do have a business unit that say, is using a different compute engine or a different storage format or whatever, we can still connect to it and work with it. And we'll build features for example to let you have access control and auditing and all these kinds of features over that.
Matei Zaharia: (19:56) So we're sort of open to those, and you can bring those in. But if you just wanna set up a group and tell them like, "Hey here's what you do so you don't have to worry about issues." And so you're always on kind of the best deck that database will work with, we recommend that.
One of the early stories about open source has been this thing about the cathedral and the bazaar. So the cathedral is the thing that's all designed by one person maybe, it's extremely coherent and so on but also kind of takes forever to build.
Matei Zaharia: (20:27) And when you go there, there's one message you're hearing and then the bazaar is the thing. It's the open thing. You don't know who's gonna show up each day, but there'll be some really interesting goods and things that you just wouldn't see anywhere else. We do wanna give people a simple unified way. If you just wanna get started and get stuff done, follow the defaults in the product and it'll work. But we wanna be open to some of that innovation and let people bring that in.
Producer 2: (20:51) Do you feel like you have to do one or the other? Or do you think you're building a sort of bazaar inside of a cathedral?
Matei Zaharia: (20:57) Yeah. It's definitely... It's a little bit of both. Yeah. And we're also trying to figure out sort of the what kind of things can we do to make it easier to interoperate and bring in things from the bazaar while having for example strong governance.
So for example in machine learning every day there's a new machine learning framework you can run. There's a new model. There's all these things that require you to basically log into some machines and run some scripts and stuff. We're trying to figure out how to do that while having strong data governance properties and strong audit trails and easy UI. There is a little bit of as you're going through the bazaar, can we kind of guide you or maybe protect you from some of the scarier things out there.
Satyen Sangani: (21:41) Yeah, yeah. And therefore you have to have sort of opinionated integrations and thoughtfulness around this. It's a pretty interesting thing.
When I was at Oracle where I kind of professionally grew up, I was actually on the app side and there was of course the database side and the database side controlled the company and there was a running joke on the app side which is if you waited long enough the database would do it. And that seems a little bit to describe your story and it literally from day one where on some level it's like we're gonna support a set of libraries, we're gonna basically support different versions of compute and that is going to be sort of an ever expanding pie, where now you've gotten into lakehouses.
Do you see this going even further into the world of sort of transactional processing or where do you describe the mission versus the vision around what's near and what's far?
Matei Zaharia: (22:23) So overall we are really focused on what we can build well and what makes sense to get as integrated data and AI platform. So for example, I'll mention some things we're not building, for example, we don't build our own machine learning algorithms and frameworks. We don't have a competitor to say PyTorch or Scikit Learn or stuff that. We let people bring the external ones from outside and we just try to make sure that they work well with other things you want from an enterprise platform like cost management, collaboration, access control and stuff like that.
The things we did build, we started out with the engine because we think that's something valuable. It directly translates into cost and into performance and usability and efficiency for users. And then we added the data management layer with Delta Lake and Unity because it was a big problem for people.
Matei Zaharia: (23:18) They said, we love your engine but, we really need you to integrate data management and to make it easy to do these things. And I don't think we'll add too many major new things. You mentioned transactional data, so definitely we have good integrations and we're working on even more integrations with bringing in data from transactional systems as events happen and also pushing stuff out so you can serve it.
Matei Zaharia: (23:43) And we also have model serving for machine learning which you can then plug into these applications. So there might be some more infrastructure there, but we're not going to go into an area unless we think we can provide real value there. And unless we think it's something that companies really want to integrate with their data and AI sort of pipelines.
Satyen Sangani: (24:04) Yeah. You mentioned Unity Catalog. Tell us a little bit about that offering and just describe what it is, why you built it and why you launched it when you did. Maybe bring us forward towards what the roadmap would be for it.
Matei Zaharia: (24:12) So Unity Catalog is this unified catalog of all the assets you have in Databricks and it's mostly operational but also has discovery features and then some governance features. So I think the really unique thing about it is, it's one place where you can see not just tables and views which are your classic data catalog things, but stuff like machine learning models, streaming datasets, dashboards, notebooks other types of assets you're working with in data, basically anything that you can build on our platform. And it's kind of a simple idea but, again historically you've got these things from different vendors, different platforms. So for example if you're using say just Amazon Web services, but really any of the cloud vendors, they all have a machine learning platform where you can train models and stuff.
Matei Zaharia: (25:03) But how do you get whole level security in that? You can't, 'cause the ML platform just thinks of the world as a bunch of files. It's literally put some data into a SD bucket and then run this thing on it. They all have a BI kind of offering, but can you go from a database stable in your database and trace through the lineage and see all the dashboards that came out of that? It's not super easy, you have to do a lot of work to put those together.
So basically we just wanna have these things in one system, on our platform in one language. Of course, it's still only for things on our platform, things you can work with under, but it does make it easier to manage those specific things. And the things we're focusing on now, apart from just giving you a uniform interface to all these things, we have some nice cross-cutting features like access control, that you can do in the same way across everything. Tagging, again same system of tags on all your stuff in Databricks and then sharing and lineage. So these are kind of the main things that we'll do.
Satyen Sangani: (26:04) Yeah. And all of those things are things that we're sort of excited to integrate with. But you mentioned one, in particular access control and this morning we all saw the announcement of Databricks buying Okera. Tell us a little bit about that and tell us a little bit about why you did it and why now.
Matei Zaharia: (26:18) Yeah. Definitely. Yeah so we see customers wanting more and more advanced access control features and I would say there are two aspects of access control.
So one is more the operational part. Like what can you do efficiently in the engine, right? And people can write a policy asking for all kinds of stuff, but if that results in each time you read a record you have to join with another table and look at things that really slows things down. So part of the reason why we are building attribute-based access control into our platform is to be able to integrate it well with the engine and then do that stuff efficiently. Same thing applies to this machine learning piece, right? Like if you have an untrusted ML library and we want to filter data very quickly and feed out only data that's secure how can we do that efficiently? That requires a little bit of design of the way we run that thing and the way the rest of the system works.
Matei Zaharia: (27:11) The other aspect of it is the policy authoring and user interface aspect. And honestly like most companies will have many data platforms not just database. They'll have all kinds of stuff. We have our offering within our platform, but we also want external higher level platform like Alation to log into that.
And the cool thing with us building the operational bits into the engine is now these ones that were before limited by more limited types of access control you can do in Databricks suddenly can do more advanced things like attribute based or like more complicated policies. So there were quite a few companies that before were like putting in plugins into Spark and stuff like that to implement their policy that hopefully can just have that part done by us and can focus on the authoring and the sync across different systems and like the really great interfaces for users in the enterprise and Okera will help with both of them. But primarily I would say with the operational piece which they've had very efficient ABAC [attribute-based access control] for a while.
Satyen Sangani: (28:12) Yeah. They deeply understand the problem. And it strikes me that there obviously have been lots of companies Okera being one of them that have been founded with this premise of sort of cross-platform, heterogeneous access right management. But it's a terribly difficult problem because the compute and the internals have to be optimized for how those access rights are provisioned. And so by building it into... I'm gonna broadly call you a database. By building you into the database you obviously can more elegantly optimize and deliver user experience. I think it makes total sense.
And from our perspective we just want to give users the ability to find the data that they need. So if you can tell us what they can access then, that makes all the sense in the world. But there's also other gnarly problems like Lineage which obviously is quite complicated particularly in the context of files and Spark where there's very complicated jobs. Tell us how you're innovating there and tell us the bounds of the problem that you're trying to solve on that score.
Matei Zaharia: (29:04) We are building lineage tracing throughout our platform. So all the computations are on Databricks and one of the cool things we're doing, there is because we did bet on this unified engine that can do the different workloads like streaming ML, SQL and so on, we can actually implement lineage like within the engine that tracks dependencies at the level of fields... At the level of columns, basically, that's what we've done so far. So we instrumented our engine to give you column level lineage. And so basically like pretty much anything you do, whether it's in data science notebooks, whether it's in ETL jobs, whether it's in SQL even if you're submitting the SQL from an external tool, we can get this fine grained lineage on.
Matei Zaharia: (29:49) So that's basically what we've built. It's very efficient. It's pretty much no overhead to run and you get the data within seconds and you can actually query the lineage as a table. So you can even write automated jobs that look at what's derived from this. So like what are the most delayed upstream data sources or stuff like that. We think it'll also be awesome for tools like Alation to plug into to analyze that data and show people insights about it. And yeah, it's basically just for things in our platform. I think for external stuff we just track where did we read from before it came into our engine.
Satyen Sangani: (30:24) I love the fact that you've made the design decision to write it as a table and to enable people to flexibly and declaratively tell what you want from it because lineage is a very, it's a complicated problem. You don't know what people are looking for. People themselves don't know what they're looking for.
We kind of similarly we're inspired by Jira and also to an extent sort of that JQL query language because the same thing is true with another layer of abstraction. Which is when people are doing data stewardship it's like I want tables that are larger than X terabytes and that have been produced by Matei that were built in this country. And you're like okay well that's a query, how do I go get it for people? And so I think this idea of flexibility at the data management level is really exciting.
You guys are invested in Alation, in our last round, we were obviously tremendously excited about that. Felt like it was a very natural evolution because I think in some sense both companies share this ethos of simplifying data in different ways, but simplifying data for people. Where do you see the partnership going forward? Obviously you're innovating on the vector of Unity Catalog and in many other domains. How do you see this part of the bazaar, if you will, opening up?
Matei Zaharia: (31:29) Yeah. I think they're very complementary. So as I said, one way to view us is we want to build the database of the future, right? For analytics that we think the database of the future needs AI first class it'll power AI apps. Obviously it needs data, it needs a great engine, it needs great sort of management. But in the grand scheme of an enterprise it is still one system. We think it's the right system for analytics for many cases, but you'll never have large enterprise that migrates just everything overnight to to one system.
And on top of that database of the future you still want, you want a great experience for users, you want discovery, you want management. And also those problems are problems that inherently you want a platform that works across everything you have in your enterprise not just Databricks.
Matei Zaharia: (32:18) So it's very complimentary. Like we're only in this part of the pie where there are a lot of big problems to make an awesome system for people to do large scale data processing NBI and all that stuff together. Before people even get to that there's this question of how do they find the data? And then for the CIOs and like the people who have to manage everything happening there's the question of okay how do I track across the hundreds of databases and vendors I have in my company? How do I track what's going on? And yeah, I think it's Unity Catalog makes this world of like the lakehouse and that initially was a little bit more off.
Matei Zaharia: (32:57) A little bit like more in the realm of just engineering and data science. It makes it much easier to build these kind of advanced management and discovery features over it and really open it up to more users. So we're really excited about that, to bring it in front of every enterprise user, basically.
Satyen Sangani: (33:13) Yeah, no, we are too. And you've mentioned AI a couple of times, and obviously a large part of building a company is problem selection. Like, what am I gonna work on? And you mentioned you're not gonna work on transaction processing 'cause that's kind of yesterday's problem. You recently announced Dolly 2.0 So tell us a little bit about what that is and why you built it.
Matei Zaharia: (33:30) I'm actually wearing my Dolly T-shirt. I don't know if you can see it, but... [laughter]
Satyen Sangani: (33:34) With sunglasses.
Matei Zaharia: (33:36) Yes. We're giving these out at our conference in June, actually. So everyone has seen in the past few months that large language models have gotten very good and actually you can build really powerful conversational interfaces with them. So ChatGPT really showed that to the world. Ironically, and then kind of strangely, even before that, GPT-3 and 3.5 could do this stuff. It's just that no one had [laughter] just put them on a webpage in front of any user to talk with. So it was pretty interesting.
Satyen Sangani: (34:06) It's almost exactly what you did with Databricks, right? On some level, it was the interface that simplified and built people's imaginations about what was possible.
Matei Zaharia: (34:13) That's absolutely true. Yeah, that's exactly it. Even for myself, I had, in my research group at Stanford, we were using GPT-3 and other stuff for some projects and we were like, okay, this is kind of cool, we can do this. But it wasn't so easy to just say like, can it help me write a tweet? Can it help me write a program for this thing I'm working on, whatever. And it became so easy to try with the chat interface.
Matei Zaharia: (34:35) This has gotten a lot of companies excited about AI with language specifically, of course AI, is also getting good at other things. There are many companies working on computer vision and stuff like that as well and traditional predictive analytics. But I think the language stuff is especially exciting for two reasons.
Matei Zaharia: (34:55) First, I think it can dramatically improve interfaces to all software, any kind of human computer interaction, you could imagine making it better through some conversational elements or at the very least to smarter semantics search or stuff like that. So everyone's thinking about it.
And the other one is if you have lots of text data that is just like a bunch of PDFs or documents or something, suddenly you have a pretty powerful tool to analyze it in bulk. So you might want your data warehouse, your SQL engine to be able to read text documents and answer questions about them that are sort of these ill-posed, not just like string contains, whatever like return, but like, hey, is this message about like a product return even though they didn't use the word return.
Matei Zaharia: (35:43) So both things are exciting. The thing we saw going in is that especially these conversational models, you can just kind of talk to and prompt and get them to do something were limited to large providers, like OpenAI and Google to some extent. And everyone thought like, oh, I have to send all my data to this external vendor, they're gonna see what I'm doing. Basically, I have to take this dependency on this service that sees extremely private data to build these features. And so we wanted to show people that you can do this yourself and to accelerate the development of open source conversational models. So with Dolly, we took basically some open source models that were released. They're not as good overall as something like ChatGPT in particular.
Matei Zaharia: (36:30) They're smaller so they don't have as much broad world knowledge, but we showed that you can make them into these comfortable what are called instruction following models where you can just tell it what you want it to do in words and it will actually do that thing without a ton of effort. And one of the most kind of fun things we did as part of that is people had showed that you can train these models to become conversational by showing them a bunch of conversations, by training them on it. Not too surprising I guess, but everyone was using outputs from ChatGPT to train them and the terms of use for like OpenAI say you can't use our outputs to build models that compete with OpenAI. So that was a big problem. You could do it in research, you could write a paper, but no one wanted to do it commercially.
Matei Zaharia: (37:16) So, we realized we have about 5,000 employees, we can just ask all of them to write a few example, like here's a message you'd like to send to a chatbot and here's what you'd like it to answer. And we created this dataset of 15,000 conversations essentially that you can use for training, and it turned out to be pretty good. It turns out to give you similar results to using the ChatGPT outputs or many other things out there.
So, it's just the start of our work in LLMs. Like with other stuff our hope is to help democratize this and help every company feel like, yeah, they can own their data, own their models, and build their own advantage in this space and really exploit this stuff on their own as they see fit for their business.
Satyen Sangani: (38:01) Do you ultimately see Dolly as a potential competitor to what OpenAI offers?
Matei Zaharia: (38:06) So, it's a great question. So, just to be clear, Dolly is surprisingly good at conversation. We were surprised 'cause we use these open source models that haven't been trained on a lot of data and just on their own were kind of just outputting gibberish for most tasks. You definitely couldn't just instruct them or prompt them to do a thing. And we gave them a little bit of instruction training with this dataset and suddenly they're pretty good. They can generate all kinds of stuff, but it's not state of the art for some of the things you would do in GPT-4 or ChatGPT. So, don't expect it to just replace that. But the way it does compete, and I think this is really important is, if you have a more limited domain, which I think most enterprise applications do, then it can be really good in that domain.
Matei Zaharia: (38:52) The main thing it's not as good at is broad world knowledge. So, we release different sizes of Dolly models based on different open source ones. And you can try stuff like if I ask it who is Ali Ghodsi, our CEO, maybe the biggest model knows about Ali. And the smallest model doesn't, it just makes up something that sounds plausible. If you ask it some question about chemistry or something, again, the bigger models know more, the smaller ones know less. But if you just wanted to do conversation, if you wanted to do stuff based on general knowledge like, I got this piece of customer feedback from someone and is it about the battery life of my product or is it about initiating a return? It'll do that just well.
Matei Zaharia: (39:36) And I also think in most domains, if you tune it on enterprise data on your jargon, like your concepts, that's a way smaller amount of stuff it has to learn than literally everything on the internet. And it'll do pretty well. And we're not the only ones doing this, if you look at, for example, code completion models, there's this really nice one built by Replit recently and there were others before. All of the ones you use as coding assistance are quite small in terms of number of parameters, very fast, very small. They're not like GPT-3 or Foresight.
So, I think there's a lot of research to be done on how many parameters you need for different applications. But I think there are lots of applications where something like Dolly works well, and we're starting to see that with our customers. We have lots of customers that have used that, have tuned that for specific things and they're getting things done, and they can fully own the infrastructure for that.
Satyen Sangani: (40:27) Yeah, it's really interesting to sort of see the world of this approximate touring test passing sort of generalized AI, and then all of the specialized models and to see sort of where those intersections will occur and whether one will enail the other or how the lines will get drawn. It's fun to watch and fun to see.
Tell us about the adoption of Dolly to date. What are you hearing from users? Has the traction been better, worse than you'd expect?
Matei Zaharia: (40:49) It's been definitely better than we expected. One of the best things about it, I think there's actually in the open source world, there are many other groups, many other companies interested in democratizing language models, and making things work with them. So, we've seen a lot of really cool things built on it in that world and it's kind of become the default if you want a commercially usable LLM that's not encumbered by any weird terms of service from OpenAI. It's become the default people use for that.
Just a simple one I saw yesterday is, someone released this pretty popular project for how to get a language model to always return answers as Jason. Which you can imagine for this analysis use case, like I wanna read a bunch of documents in my company and extract structure and data from them, but I always wanted to extract it with the same schema.
Matei Zaharia: (41:40) Don't make up a schema, just tell me like, is it a return? What product is it from what country? I give you a schema. And someone came up with a trick to get the model to do that by only filling in the blanks in an existing sort of skeleton of an output. And they use Dolly as the example, so you can just use that everywhere. So, it's been awesome to see this and within customer, yeah, we've seen everything from retailers, restaurants, insurance companies, financial companies using it and prototyping things. We've seen these healthcare companies that are doing local stuff that's supposed to run on your phone, and it's very private data that you never wanna send out. So, quite a bit of use.
Matei Zaharia: (42:24) As I said, and this is sort of the beginning, I think the research community, the open source community will produce even better models. We're also looking at certain aspects of it. We're really interested among the common use cases our customers have, how to make those easier and how to sort of package those up into a solution. And I think you'll see some really cool stuff coming out.
Satyen Sangani: (42:47) Yeah, it'll be interesting to watch, because on one level, one of the value propositions is to be able to preserve your IP and not have to deal with these terms of service. And then on the other hand, an open source model would allow for contribution back. So, how people think of what they contribute back and what they're gonna keep specialized should be fun and interesting for you to see.
Matei Zaharia: (43:03) I think many companies are now viewing their data as even more valuable asset than before. Lots of companies actually that made stuff crawlable on the web like credit are turning that off and saying, "You can't use this to train ML models, 'cause we wanna sell it to you for that." We think many companies will build proprietary things. The open source model is a good basis to start from, but I think everyone will then kind of tune it for their domain and get something better and then figure out how to use that. Use it in your product, maybe resell it somehow, whatever it is that they do.
Satyen Sangani: (43:36) Absolutely. So, maybe two more questions before I let you go. The first is on culture within Databricks. You had a lot of open sourcing of the knowledge that got Dolly off the ground. We and this podcast talk a lot about data culture, about what it means to make sort of organizational based data decisions. How would you describe the data culture at Databricks? It's an enterprise company, it's got lots of academics, lots of innovation, open source, so it's an interesting playground for construction.
Matei Zaharia: (44:01) Yeah, we are a large company. We have many of the things you'd see as as a public company, even though we're not public, we try to operate internally as if we were. 'Cause we're getting close to being public at some point. So we have a lot of those same concerns.
Basically, the culture is... As much as possible, we try to be very transparent internally, all the way from basically Ali, like showing the whole company, our basically board deck every time we have a board meeting. And we try to let people work with data and discover things on their own while keeping things private and secure. So, we try to build all these versions of tables and datasets that anyone can work with, and to figure out sort of lightweight processes for people to request a new thing or to build their own, or to create their own datasets and work with that.
Matei Zaharia: (44:52) And we have tens of thousands of dashboards and tables and things like that built internally that people use. We also have these restricted domains as you can imagine many other companies would for things like finance and HR. And of course, we heavily use our own product, and we sort of act as the first customer of things like Unity Catalog for that. But yeah, our philosophy has been, let people look around and let people try to build things. And we focus very significantly on letting users kinda do their own thing and then ask the data team to operationalize things that are extremely valuable.
Satyen Sangani: (45:28) Yeah, so transparency from the top and open and available information everywhere.
Matei Zaharia: (45:34) Yeah, as much as possible. Obviously, there are a lot of things you can't just look at as an employee, but yeah.
Satyen Sangani: (45:40) So, you're a Stanford professor and you're also the chief technologist at a company that might be one of the biggest companies in data. That feels like a couple of different hard jobs. How do you allocate your time and balance your time, and how do you define yourself in the context of having so many competing alternatives on your time?
Matei Zaharia: (45:55) I started out in academia and research, and since I had the opportunity to do that as a professor, I wanted to see what it's like, and to see what I can do there. So, I decided to split my time between them after first two years of Databricks, I was there full-time.
The only reason it's possible is because we have an amazing team overall, so I can still add a lot of value without having to be there all the time. All of engineering reports to our SVP of engineering, who's not me. And so, I don't have to do all the day-to-day management stuff you can imagine there. Usually, I work on a few products where I'm involved in detail at Databricks, and then also on overall company roadmap and strategy. Sort of looking across all the things we're doing and making sure that they fit well together and that they make sense. And in terms of how I actually split my time, I try to have days that are fully dedicated to either Databricks or Stanford stuff.
Matei Zaharia: (46:53) And one thing is as a professor, it is a job where you're supposed to be able to do other stuff. All universities let you do one day per week of external work, usually, during the school year, and then in the summer they let you do basically whatever you want every summer. So, it gives you the opportunity to look outside and do things in the wider world outside the university. So it's not completely at odds with that.
But yeah, I've tried to pick things so that I get to explore and research kind of up and coming things at Stanford, like the work I was doing with LLMs and stuff before they became super popular. And maybe I learned some stuff that will be eventually useful for Databricks, and at Databricks I can focus on the things I'm particularly good at.
Satyen Sangani: (47:38) It's incredible and it sounds pretty gratifying. So, great to meet you, great to have you take the time. I think everybody who would be listening to this would appreciate the insight and the thoughts about what you've built. So, look forward to building the relationship and look forward to getting you back again on the podcast in fair order.
Matei Zaharia: (47:55) Yeah, awesome chatting. Yeah, it was a lot of fun and it's been awesome working with you guys so far. And yeah, this is a super exciting time for data in general. And I always tell people, I think it's still at an early stage, we're still figuring out the right interfaces for everything, the right ways to really democratize these within an enterprise and so on. And it's awesome to be working with you guys.
[music]
Satyen Sangani: (48:21) From starting the Spark Project to co-founding Databricks. Matei has remained focused on a core problem statement. How do we make data and analytics accessible to more people?
Building a platform like Databricks is a lot like constructing a cathedral. It must be elegant, consistent, and robust. However, it also has to accommodate for a bazaar of use cases ranging from business intelligence to machine learning to AI.
If there's one thing you take away from this conversation then, it's that staying authentic to your vision will pay dividends in the long run. Sometimes that means building a rock solid user experience and other times it means building an open platform that would allow others to showcase what they can do.
Thank you for listening to this episode, and thank you Matei for joining. I'm your host, Satyen Sangani, CEO of Alation. And Data Radicals stay the course, keep learning and sharing. Until next time.
Producer 2: (49:11) This podcast is brought to you by Alation. Let's meet up at the Databricks Summit this summer! We'll reveal how Alation data intelligence is key to your data lakehouse success. Get a firsthand look into how top organizations are simplifying cloud complexity with Alation and Databricks. The data and AI Summit runs from June 26th to the 29th in San Francisco. We can't wait to connect. Learn more at databricks.com/dataaisummit.
Season 2 Episode 7
As EVP and global head of data analytics and AI at LTIMindtree, one of India’s Super Six tech giants, Jitendra Putcha serves today’s clients and mentors tomorrow’s tech leaders with his past experience as well as a seer’s eye on the future. In this conversation, Putcha shares his insights on cloud computing, and AI, or the evolution of the Indian workforce.
Season 2 Episode 5
Seeing is believing — or is it? Today, Photoshop and AI make it easy to falsify images that can find their way into scientific research. Science integrity consultant Dr. Elisabeth Bik, an expert at spotting fishy images, addresses the murky world of research, the impact of “publish or perish”, and how to restore trust in science through reproducibility.
Season 1 Episode 20
Forget surveys — study Google searches! Bestselling author Seth Stephens-Davidowitz leverages the accuracy of Google Trends to maintain an unbiased, data-driven mindset — and offers ways to unearth honest insights to power your business decisions.